The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximize earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones that exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import pdist
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to perform PCA
from sklearn.decomposition import PCA
# to compute distances
from scipy.spatial.distance import cdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv("stock_data.csv") # reading csv and loading into datframe
df_bk = df.copy()
df.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
df.tail()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.259998 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 939457327.6 | 28.976191 | 6.261775 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 435353535.4 | 17.682214 | -3.838260 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.589996 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 188461538.5 | 131.525636 | -23.884449 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.299999 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 257892500.0 | 22.749999 | -0.063096 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.919998 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 498529411.8 | 70.470585 | 1.723068 |
print("Shape of the dataset: ", df.shape)
Shape of the dataset: (340, 15)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
print("Missing value check in dataset")
print(df.isnull().sum())
Missing value check in dataset Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
print("Duplicate row counts in dataset")
print(df.duplicated().sum())
Duplicate row counts in dataset 0
Questions:
pd.options.display.float_format = "{:.5f}".format
df.describe().T # summary of variables, transpose for better readability
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.00000 | 80.86234 | 98.05509 | 4.50000 | 38.55500 | 59.70500 | 92.88000 | 1274.94995 |
| Price Change | 340.00000 | 4.07819 | 12.00634 | -47.12969 | -0.93948 | 4.81950 | 10.69549 | 55.05168 |
| Volatility | 340.00000 | 1.52598 | 0.59180 | 0.73316 | 1.13488 | 1.38559 | 1.69555 | 4.58004 |
| ROE | 340.00000 | 39.59706 | 96.54754 | 1.00000 | 9.75000 | 15.00000 | 27.00000 | 917.00000 |
| Cash Ratio | 340.00000 | 70.02353 | 90.42133 | 0.00000 | 18.00000 | 47.00000 | 99.00000 | 958.00000 |
| Net Cash Flow | 340.00000 | 55537620.58824 | 1946365312.17579 | -11208000000.00000 | -193906500.00000 | 2098000.00000 | 169810750.00000 | 20764000000.00000 |
| Net Income | 340.00000 | 1494384602.94118 | 3940150279.32794 | -23528000000.00000 | 352301250.00000 | 707336000.00000 | 1899000000.00000 | 24442000000.00000 |
| Earnings Per Share | 340.00000 | 2.77666 | 6.58778 | -61.20000 | 1.55750 | 2.89500 | 4.62000 | 50.09000 |
| Estimated Shares Outstanding | 340.00000 | 577028337.75403 | 845849595.41769 | 27672156.86000 | 158848216.10000 | 309675137.80000 | 573117457.32500 | 6159292035.00000 |
| P/E Ratio | 340.00000 | 32.61256 | 44.34873 | 2.93545 | 15.04465 | 20.81988 | 31.76476 | 528.03907 |
| P/B Ratio | 340.00000 | -1.71825 | 13.96691 | -76.11908 | -4.35206 | -1.06717 | 3.91707 | 129.06459 |
Current Price - Data is ranged from ~\$4.5 to ~\\$1275, however majority (75%) of data have current price less than ~\$92.80, with overall average price less than ~\\$80.86Price Change - Majority of observations are under ~10% change in past 13 weeks, however we can see we have observations with change as high as ~55%, with a minimum as ~4%Volatility - We can observe most of the stocks have the volatility ranged from 1.38 to 1.69, with an average of 1.52, however there are some high volatilit stocks as higher range is as high as 4.58ROE We can observe there are stocks with high ROE as 917, where as most stocks are having average ROE as ~39Cash Ratio We can see we have stocks with high cash ratio as high as 958, where as most stocks are with average cash ratio of ~70 with a minimum of 0Net Cash Flow - Data looks very skewed with high difference in mean and median, also the variance is very high.Net Income - Data is wide ranged and skewed with mean and median has significant difference, also data has high variance with min high in negative and max with high positive amount.Earnings Per Share - Minimum earning per shares shows in negative, which tells us there are companies in loss, also data has high variance with max as ~50, however the mean and median is close to each other.Estimated Shares Outstanding - Data has high variance and looks skewed. P/E Ratio - Data looks skewed with a high variance, min value is ~3 and max value shows ~528P/B Ratio - Almost 50% of data shows in negative P/B ratio which tells us they are trading at a lower price with respect to their assets and market value.df.describe(include="object").T
| count | unique | top | freq | |
|---|---|---|---|---|
| Ticker Symbol | 340 | 340 | AAL | 1 |
| Security | 340 | 340 | American Airlines Group | 1 |
| GICS Sector | 340 | 11 | Industrials | 53 |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 |
# Below is a function which will take input as numerical attributes,
# titlevar to be able to dynamically populate the title for each plot
# figure size (to have the flexibility to have custom figsize if necessary), bins(for histogram) and
# returns boxplot and histogram plot for the numerical attribute passed.
def box_hist(feature, titlevar, figsize=(15, 10), bins=None):
sns.set(style="ticks")
sns.set_style("darkgrid")
fig, (axb, axh) = plt.subplots(
nrows=2, # Number of rows to be plotted for subplot
sharex=True, # x-axis will be shared for all subplots
gridspec_kw={
"height_ratios": (0.25, 0.75)
}, # pass on the grid spaces to Gridspace constructor for subplots
figsize=figsize,
)
sns.boxplot(
feature,
ax=axb,
showmeans=True,
color="#00FFA0",
meanprops={"markerfacecolor": "white", "markeredgecolor": "black"},
) # plots the boxplot and show a marker for mean value
axb.set_title("BoxPlot for: {}".format(titlevar), fontweight="bold", fontsize=18)
sns.distplot(
feature, ax=axh, kde=True, bins=bins, color="#004DFD"
) if bins else sns.distplot(
feature, ax=axh, kde=True, color="#004DFD"
) # plots histogram
axh.axvline(
np.mean(feature), linestyle="-", color="red"
) # Add a red vertical line for mean value of the feature
axh.axvline(
np.median(feature), linestyle="--", color="black"
) # Add a green vertical line for median value of the feature
plt.ylabel("Count")
axh.set_title(
"Distribution Plot for: {}".format(titlevar), fontweight="bold", fontsize=18
)
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
sns.set(style="ticks")
sns.set_style("darkgrid")
total = len(data[feature]) # length of the column
count = data[feature].nunique()
plt.figure(figsize=(20, 12))
plt.xticks(rotation=45, fontsize=12)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
ax.set_title(
"Proportion based on: {}".format(feature), fontweight="bold", fontsize=14,
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Creating list of columns which are numeric in nature
numeric_columns = df.select_dtypes(include=("int", "float")).columns
numeric_columns
Index(['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'],
dtype='object')
# Write a loop which will plot boxplot and histogram for all variables in dependent_vars list
for col in numeric_columns:
box_hist(df[col], col)
Current PricePrice ChangeVolatilityROECash RatioNet Cash FlowNet IncomeEarnings Per ShareEstimated Shares OutstandingP/E RatioP/B Ratiolabeled_barplot(df, "GICS Sector", perc=True)
fig, ax = plt.subplots(figsize=(15, 45)) # figsize and axis declaration
sns.set_style("darkgrid") # seaborn background set to dark
ax = sns.countplot(
y="GICS Sub Industry",
data=df,
orient="v",
palette="Paired",
order=df["GICS Sub Industry"].value_counts().index,
) # plotting the countplot
plt.show()
# plotiing pairplot for all the variables to observe any patterns with target variable
sns.set(style="ticks")
sns.set_style("darkgrid")
plt.figure(figsize=(20, 12))
sns.pairplot(df)
plt.show()
<Figure size 1440x864 with 0 Axes>
plt.figure(figsize=(20, 12))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, cmap="Spectral", fmt=".2f")
plt.show()
# plots boxplots for price change by GICS Sector
fig, ax = plt.subplots(figsize=(15, 8))
ax = sns.boxplot(df["Price Change"], df["GICS Sector"], orient="h", palette="winter_r",)
plt.title("Average Price Change Per Economic Sector", fontsize=16, fontweight="bold")
plt.show()
# plots boxplots for cash ratio by each economic sector
fig, ax = plt.subplots(figsize=(15, 8))
ax = sns.boxplot(df["Cash Ratio"], df["GICS Sector"], orient="h", palette="winter_r",)
plt.title("Average Cash Ratio Per Economic Sector", fontsize=16, fontweight="bold")
plt.show()
# plots boxplots for P/E ratio by each economic sector
fig, ax = plt.subplots(figsize=(15, 8))
ax = sns.boxplot(df["P/E Ratio"], df["GICS Sector"], orient="h", palette="winter_r",)
plt.title("Average P/E Ratio Per Economic Sector", fontsize=16, fontweight="bold")
plt.show()
# plots boxplots for estimated shares outstanding by each economic sector
fig, ax = plt.subplots(figsize=(15, 8))
ax = sns.boxplot(
df["Estimated Shares Outstanding"],
df["GICS Sector"],
orient="h",
palette="winter_r",
)
plt.title(
"Average Estimated Shares Outstanding Per Economic Sector",
fontsize=16,
fontweight="bold",
)
plt.show()
Duplicate value check
Missing value treatment
for col in numeric_columns:
print(f"Unique values in numeric column {col}")
print(df[col].value_counts())
print("*" * 50)
Unique values in numeric column Current Price
54.36000 2
114.38000 2
43.23000 2
32.93000 2
103.89000 1
..
51.07000 1
52.13000 1
81.59000 1
302.39999 1
47.92000 1
Name: Current Price, Length: 336, dtype: int64
**************************************************
Unique values in numeric column Price Change
9.99999 1
6.23785 1
6.58555 1
5.79688 1
9.52997 1
..
8.03760 1
10.01965 1
-4.07859 1
3.65424 1
16.67884 1
Name: Price Change, Length: 340, dtype: int64
**************************************************
Unique values in numeric column Volatility
1.68715 1
1.02338 1
1.46062 1
1.28157 1
2.16881 1
..
1.05619 1
1.30808 1
1.94110 1
1.40451 1
1.61028 1
Name: Volatility, Length: 340, dtype: int64
**************************************************
Unique values in numeric column ROE
10 24
9 16
14 16
11 16
17 15
8 14
7 13
13 12
19 10
16 9
6 9
24 9
4 9
12 9
15 9
5 9
26 7
20 7
2 7
22 6
25 6
29 6
18 6
23 5
3 5
30 5
35 5
27 4
28 4
64 3
21 3
1 3
228 2
32 2
36 2
45 2
34 2
43 2
42 2
52 2
82 2
174 1
200 1
61 1
244 1
73 1
86 1
60 1
47 1
103 1
121 1
109 1
40 1
59 1
48 1
37 1
196 1
596 1
182 1
135 1
167 1
263 1
601 1
917 1
38 1
687 1
44 1
589 1
463 1
205 1
130 1
155 1
98 1
41 1
51 1
92 1
33 1
582 1
116 1
68 1
63 1
142 1
Name: ROE, dtype: int64
**************************************************
Unique values in numeric column Cash Ratio
99 37
47 19
25 10
70 8
10 7
27 7
2 7
9 7
3 7
22 6
18 6
11 6
8 5
117 5
6 5
31 5
14 5
67 5
12 4
74 4
39 4
13 4
1 4
43 4
4 4
0 4
16 3
20 3
41 3
29 3
38 3
26 3
5 3
81 3
15 3
37 3
7 3
61 3
49 3
130 2
116 2
35 2
19 2
129 2
34 2
21 2
65 2
190 2
23 2
198 2
46 2
33 2
30 2
44 2
42 2
108 2
54 2
51 2
45 2
36 2
225 2
163 2
24 2
79 2
60 2
58 2
131 1
64 1
147 1
195 1
201 1
362 1
110 1
57 1
71 1
62 1
80 1
68 1
184 1
136 1
272 1
180 1
88 1
212 1
115 1
126 1
56 1
127 1
221 1
425 1
83 1
459 1
568 1
260 1
17 1
183 1
257 1
94 1
77 1
271 1
958 1
52 1
73 1
182 1
496 1
148 1
48 1
121 1
237 1
333 1
189 1
92 1
103 1
40 1
162 1
317 1
53 1
133 1
84 1
82 1
128 1
175 1
164 1
100 1
Name: Cash Ratio, dtype: int64
**************************************************
Unique values in numeric column Net Cash Flow
159000000 2
39000000 2
160383000 2
-8000000 2
23000000 2
..
4073000 1
-14756000 1
523803000 1
2196000 1
-43623000 1
Name: Net Cash Flow, Length: 332, dtype: int64
**************************************************
Unique values in numeric column Net Income
1899000000 2
355000000 2
848000000 2
1034000000 2
636056000 2
..
268000000 1
232120000 1
878485000 1
85171000 1
339000000 1
Name: Net Income, Length: 335, dtype: int64
**************************************************
Unique values in numeric column Earnings Per Share
3.28000 5
3.71000 4
3.03000 4
1.80000 4
1.58000 4
..
5.87000 1
1.86000 1
1.77000 1
1.31000 1
1.20000 1
Name: Earnings Per Share, Length: 268, dtype: int64
**************************************************
Unique values in numeric column Estimated Shares Outstanding
199237804.90000 4
513987730.10000 3
111522633.70000 3
1469541779.00000 2
1633015873.00000 2
..
367139240.50000 1
57776615.38000 1
545779855.20000 1
148511383.50000 1
498529411.80000 1
Name: Estimated Shares Outstanding, Length: 323, dtype: int64
**************************************************
Unique values in numeric column P/E Ratio
93.08929 17
20.81988 6
24.07012 5
28.40793 3
17.31308 3
..
18.63636 1
21.17547 1
55.82911 1
7.57527 1
70.47059 1
Name: P/E Ratio, Length: 295, dtype: int64
**************************************************
Unique values in numeric column P/B Ratio
-1.26933 6
-3.08948 5
-3.97340 4
3.95497 4
-4.32714 4
..
-14.15145 1
5.10876 1
4.42743 1
5.88447 1
1.72307 1
Name: P/B Ratio, Length: 299, dtype: int64
**************************************************
df_outrm = df.copy()
# let's plot the boxplots of all columns to check for outliers
plt.figure(figsize=(15, 10))
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
plt.boxplot(df_outrm[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
treat outlier in all numerical variables
col_list: list of numerical variables
df: data frame
"""
for c in col_list:
df = treat_outliers(df, c)
return df
# treating the outliers except ram
df_outrm = treat_outliers_all(df_outrm, numeric_columns)
df_outrm
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.35000 | 9.99999 | 1.68715 | 52.87500 | 51.00000 | -604000000.00000 | 4219048125.00000 | 9.21375 | 668129938.50000 | 3.71817 | -8.78422 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.24000 | 8.33943 | 2.19789 | 52.87500 | 77.00000 | 51000000.00000 | 4219048125.00000 | 3.15000 | 1194521319.16250 | 18.80635 | -8.75007 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.91000 | 11.30112 | 1.27365 | 21.00000 | 67.00000 | 715386625.00000 | 4219048125.00000 | 2.94000 | 1194521319.16250 | 15.27551 | -0.39417 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.94000 | 13.97720 | 1.35768 | 9.00000 | 180.00000 | -240840000.00000 | 629551000.00000 | 1.26000 | 499643650.80000 | 56.84491 | 4.19965 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.32000 | -1.82786 | 1.70117 | 14.00000 | 220.50000 | 315120000.00000 | 696878000.00000 | 0.31000 | 1194521319.16250 | 56.84491 | 1.05981 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.26000 | 14.88773 | 1.84515 | 15.00000 | 220.50000 | -739482375.00000 | -1967746875.00000 | -3.03625 | 939457327.60000 | 28.97619 | 6.26177 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.51618 | -8.69892 | 1.47888 | 52.87500 | 27.00000 | 159000000.00000 | 1293000000.00000 | 2.97000 | 435353535.40000 | 17.68221 | -3.83826 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.59000 | 9.34768 | 1.40421 | 1.00000 | 100.00000 | 376000000.00000 | 147000000.00000 | 0.78000 | 188461538.50000 | 56.84491 | -16.75574 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.30000 | -1.15859 | 1.46818 | 4.00000 | 99.00000 | -43623000.00000 | 309471000.00000 | 1.20000 | 257892500.00000 | 22.75000 | -0.06310 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.92000 | 16.67884 | 1.61028 | 32.00000 | 65.00000 | 272000000.00000 | 339000000.00000 | 0.68000 | 498529411.80000 | 56.84491 | 1.72307 |
340 rows × 15 columns
# let's plot the boxplots of all columns to check for outliers
plt.figure(figsize=(15, 10))
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
plt.boxplot(df_outrm[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# scaling the dataset before clustering
scaler = StandardScaler()
subset = df[numeric_columns].copy()
subset_scaled = scaler.fit_transform(subset)
# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
subset_scaled_df
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.39334 | 0.49395 | 0.27275 | 0.98960 | -0.21070 | -0.33936 | 1.55442 | 1.30940 | 0.10786 | -0.65249 | -0.50665 |
| 1 | -0.22084 | 0.35544 | 1.13704 | 0.93774 | 0.07727 | -0.00233 | 0.92763 | 0.05675 | 1.25027 | -0.31177 | -0.50420 |
| 2 | -0.36719 | 0.60248 | -0.42701 | -0.19290 | -0.03349 | 0.45406 | 0.74437 | 0.02483 | 1.09802 | -0.39150 | 0.09494 |
| 3 | 0.13357 | 0.82570 | -0.28480 | -0.31738 | 1.21806 | -0.15250 | -0.21982 | -0.23056 | -0.09162 | 0.94715 | 0.42433 |
| 4 | -0.26087 | -0.49264 | 0.29647 | -0.26551 | 2.23702 | 0.13356 | -0.20270 | -0.37498 | 1.97840 | 3.29331 | 0.19920 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | -0.48618 | 0.90165 | 0.54012 | -0.25514 | 4.30816 | -0.55967 | -1.48778 | -1.12748 | 0.42911 | -0.08212 | 0.57219 |
| 336 | -0.28951 | -1.06577 | -0.07970 | 1.06221 | -0.47651 | 0.05324 | -0.05119 | 0.02939 | -0.16774 | -0.33715 | -0.15201 |
| 337 | 0.22191 | 0.43954 | -0.20607 | -0.40036 | 0.33201 | 0.16489 | -0.34247 | -0.30353 | -0.46006 | 2.23363 | -1.58939 |
| 338 | -0.54705 | -0.43681 | -0.09781 | -0.36924 | 0.32093 | -0.05102 | -0.30117 | -0.23968 | -0.37785 | -0.22271 | 0.11868 |
| 339 | -0.33645 | 1.05105 | 0.14267 | -0.07880 | -0.05564 | 0.11138 | -0.29367 | -0.31873 | -0.09294 | 0.85490 | 0.24675 |
340 rows × 11 columns
# scaling the dataset where we have treated outliers
subset_outrm = df_outrm[numeric_columns].copy()
subset_scaled_outrm = scaler.fit_transform(subset_outrm)
# Creating a dataframe from the scaled data
subset_scaled_df_outrm = pd.DataFrame(subset_scaled_outrm, columns=subset_outrm.columns)
subset_scaled_df_outrm
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.65071 | 0.54929 | 0.44821 | 2.11683 | -0.20731 | -1.43376 | 1.97714 | 2.11490 | 0.64505 | -1.49377 | -0.99734 |
| 1 | -0.25796 | 0.38791 | 1.56884 | 2.11683 | 0.23858 | 0.13301 | 1.97714 | 0.01798 | 2.12480 | -0.46214 | -0.99284 |
| 2 | -0.59118 | 0.67573 | -0.45908 | 0.03153 | 0.06709 | 1.72223 | 1.97714 | -0.05464 | 2.12480 | -0.70355 | 0.10897 |
| 3 | 0.54893 | 0.93580 | -0.27470 | -0.75352 | 2.00502 | -0.56508 | -0.37174 | -0.63560 | 0.17142 | 2.13868 | 0.71471 |
| 4 | -0.34912 | -0.60018 | 0.47897 | -0.42642 | 2.69958 | 0.76479 | -0.32768 | -0.96413 | 2.12480 | 2.13868 | 0.30069 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | -0.86209 | 1.02429 | 0.79488 | -0.36099 | 2.69958 | -1.75784 | -2.07134 | -2.12130 | 1.40779 | 0.23321 | 0.98662 |
| 336 | -0.41431 | -1.26792 | -0.00877 | 2.11683 | -0.61891 | 0.39134 | 0.06241 | -0.04427 | -0.00931 | -0.53900 | -0.34517 |
| 337 | 0.75007 | 0.48589 | -0.17261 | -1.27689 | 0.63303 | 0.91041 | -0.68750 | -0.80159 | -0.70335 | 2.13868 | -2.04846 |
| 338 | -1.00068 | -0.53513 | -0.03225 | -1.08063 | 0.61588 | -0.09333 | -0.58119 | -0.65635 | -0.50817 | -0.19250 | 0.15263 |
| 339 | -0.52119 | 1.19835 | 0.27956 | 0.75116 | 0.03279 | 0.66164 | -0.56186 | -0.83618 | 0.16829 | 2.13868 | 0.38815 |
340 rows × 11 columns
# Write a loop which will plot boxplot and histogram for all variables in subset scaled df
for col in numeric_columns:
box_hist(subset_scaled_df[col], col)
# plotiing pairplot for all the variables to observe any patterns with target variable
sns.set(style="ticks")
sns.set_style("darkgrid")
plt.figure(figsize=(20, 12))
sns.pairplot(subset_scaled_df)
plt.show()
<Figure size 1440x864 with 0 Axes>
# plots heatmap for correlation matrix
plt.figure(figsize=(20, 12))
sns.heatmap(
subset_scaled_df.corr(), annot=True, vmin=-1, vmax=1, cmap="Spectral", fmt=".2f"
)
plt.show()
clusters = range(1, 9)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction = model.predict(subset_scaled_df)
distortion = (
sum(
np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
)
/ subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.2967023635156316 Number of Clusters: 4 Average Distortion: 2.179645269703779 Number of Clusters: 5 Average Distortion: 2.1283941117706435 Number of Clusters: 6 Average Distortion: 2.0608145445053134 Number of Clusters: 7 Average Distortion: 1.9847898674793327 Number of Clusters: 8 Average Distortion: 1.9887592433275094
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
clusters = range(1, 9)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(subset_scaled_df_outrm)
prediction = model.predict(subset_scaled_df_outrm)
distortion = (
sum(
np.min(
cdist(subset_scaled_df_outrm, model.cluster_centers_, "euclidean"),
axis=1,
)
)
/ subset_scaled_df_outrm.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
Number of Clusters: 1 Average Distortion: 3.1482665386211908 Number of Clusters: 2 Average Distortion: 2.9512720112654662 Number of Clusters: 3 Average Distortion: 2.7587853682733137 Number of Clusters: 4 Average Distortion: 2.632234476261218 Number of Clusters: 5 Average Distortion: 2.53551645123123 Number of Clusters: 6 Average Distortion: 2.4620633145362953 Number of Clusters: 7 Average Distortion: 2.3915576127243 Number of Clusters: 8 Average Distortion: 2.3279994516486044
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
For original dataset
sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((subset_scaled_df))
# centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
For n_clusters = 2, silhouette score is 0.4364805522329805 For n_clusters = 3, silhouette score is 0.45494915445064904 For n_clusters = 4, silhouette score is 0.4574398365861574 For n_clusters = 5, silhouette score is 0.4490668565313142 For n_clusters = 6, silhouette score is 0.38768555359623835 For n_clusters = 7, silhouette score is 0.13685994703111565 For n_clusters = 8, silhouette score is 0.3886929719130642 For n_clusters = 9, silhouette score is 0.30629190920677196
[<matplotlib.lines.Line2D at 0x7fcb1c2b75e0>]
For outliers removed dataset
sil_score_outrm = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((subset_scaled_df_outrm))
# centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df_outrm, preds)
sil_score_outrm.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score_outrm)
For n_clusters = 2, silhouette score is 0.1334867211279908 For n_clusters = 3, silhouette score is 0.18282709918315632 For n_clusters = 4, silhouette score is 0.16921992576864422 For n_clusters = 5, silhouette score is 0.15336308532333096 For n_clusters = 6, silhouette score is 0.13665094400431 For n_clusters = 7, silhouette score is 0.1265267327247148 For n_clusters = 8, silhouette score is 0.1237240596948121 For n_clusters = 9, silhouette score is 0.12332436930789177
[<matplotlib.lines.Line2D at 0x7fcb1e3818e0>]
For K value 7
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
For K value 6
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
For K value 5
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
For K value 7
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(subset_scaled_df_outrm)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
For K value 6
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(subset_scaled_df_outrm)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
For K value 5
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df_outrm)
visualizer.show()
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
From all above exploration we can finalize K means is performing better with optimal value of K as 5 on the outliers removed scaled dataset
Fitting K means on outliers removed dataset with K as 5
kmeans = KMeans(n_clusters=5, random_state=1)
kmeans.fit(subset_scaled_df_outrm)
KMeans(n_clusters=5, random_state=1)
Adding cluster labels to original dataset
# adding kmeans cluster labels to the original dataframe
df["K_means_segments"] = kmeans.labels_
Creating cluster profile with K means clusters
cluster_profile = df.groupby("K_means_segments").mean()
cluster_profile
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | |||||||||||
| 0 | 32.27099 | -14.80780 | 2.79764 | 100.72414 | 41.03448 | -204367103.44828 | -3607376896.55172 | -8.82276 | 581351388.28414 | 75.06039 | 1.01897 |
| 1 | 47.87302 | 4.15488 | 1.42763 | 19.26056 | 46.37324 | -147858471.83099 | 828981295.77465 | 2.48842 | 347116490.60754 | 20.61038 | -1.77668 |
| 2 | 148.93186 | 5.98251 | 1.31276 | 59.38571 | 40.12857 | 13453271.42857 | 988639814.28571 | 6.01771 | 186987714.98943 | 30.76531 | -7.61926 |
| 3 | 128.59447 | 10.73742 | 1.74646 | 25.14634 | 227.29268 | 488585634.14634 | 685962536.58537 | 2.66561 | 579175982.42683 | 70.47230 | 7.72282 |
| 4 | 70.03047 | 6.32774 | 1.23239 | 45.15517 | 67.32759 | 428131034.48276 | 6856208620.68966 | 5.44897 | 1606974958.29483 | 16.23983 | -2.49576 |
Adding counts for each segment
cluster_profile["count_in_each_segment"] = (
df.groupby("K_means_segments")["Current Price"].count().values
)
# let's display cluster profiles
cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | ||||||||||||
| 0 | 32.270994 | -14.807804 | 2.797639 | 100.724138 | 41.034483 | -204367103.448276 | -3607376896.551724 | -8.822759 | 581351388.284138 | 75.060395 | 1.018974 | 29 |
| 1 | 47.873015 | 4.154880 | 1.427633 | 19.260563 | 46.373239 | -147858471.830986 | 828981295.774648 | 2.488415 | 347116490.607535 | 20.610382 | -1.776681 | 142 |
| 2 | 148.931857 | 5.982512 | 1.312759 | 59.385714 | 40.128571 | 13453271.428571 | 988639814.285714 | 6.017714 | 186987714.989429 | 30.765310 | -7.619256 | 70 |
| 3 | 128.594471 | 10.737422 | 1.746457 | 25.146341 | 227.292683 | 488585634.146341 | 685962536.585366 | 2.665610 | 579175982.426829 | 70.472296 | 7.722820 | 41 |
| 4 | 70.030465 | 6.327743 | 1.232391 | 45.155172 | 67.327586 | 428131034.482759 | 6856208620.689655 | 5.448966 | 1606974958.294828 | 16.239828 | -2.495758 | 58 |
# Plotting subplots for all numerical columns for each cluster labels
plt.figure(figsize=(20, 15))
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
sns.boxplot(y=df[variable], x=df["K_means_segments"])
plt.tight_layout()
plt.title(variable)
plt.suptitle("Boxplot of numerical variables for each cluster", fontsize=18)
plt.show()
df.groupby("K_means_segments").mean().plot.bar(figsize=(20, 12))
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
subset_scaled_df_plt = subset_scaled_df.copy()
subset_scaled_df_plt["K_means_segments"] = kmeans.labels_
subset_scaled_df_plt.groupby("K_means_segments").mean().plot.bar(figsize=(20, 12))
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.9259195530524591. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159737. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9232271494002922. Cophenetic correlation for complete linkage is 0.7873280186580672. Cophenetic correlation for average linkage is 0.9422540609560814. Cophenetic correlation for centroid linkage is 0.9314012446828154. Cophenetic correlation for ward linkage is 0.7101180299865353. Cophenetic correlation for weighted linkage is 0.8693784298129404.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
# applying agglomerative clustering on original scaled df
HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="average")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(linkage='average', n_clusters=5)
# Coping the dataset to a new dataset for keeing the cluster labels separate for hierarchical clustering
dfhc = df.copy()
dfhc = dfhc.drop(["K_means_segments"], axis=1)
dfhc
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.35000 | 9.99999 | 1.68715 | 135 | 51 | -604000000 | 7610000000 | 11.39000 | 668129938.50000 | 3.71817 | -8.78422 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.24000 | 8.33943 | 2.19789 | 130 | 77 | 51000000 | 5144000000 | 3.15000 | 1633015873.00000 | 18.80635 | -8.75007 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.91000 | 11.30112 | 1.27365 | 21 | 67 | 938000000 | 4423000000 | 2.94000 | 1504421769.00000 | 15.27551 | -0.39417 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.94000 | 13.97720 | 1.35768 | 9 | 180 | -240840000 | 629551000 | 1.26000 | 499643650.80000 | 74.55556 | 4.19965 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.32000 | -1.82786 | 1.70117 | 14 | 272 | 315120000 | 696878000 | 0.31000 | 2247993548.00000 | 178.45161 | 1.05981 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.26000 | 14.88773 | 1.84515 | 15 | 459 | -1032187000 | -4359082000 | -4.64000 | 939457327.60000 | 28.97619 | 6.26177 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.51618 | -8.69892 | 1.47888 | 142 | 27 | 159000000 | 1293000000 | 2.97000 | 435353535.40000 | 17.68221 | -3.83826 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.59000 | 9.34768 | 1.40421 | 1 | 100 | 376000000 | 147000000 | 0.78000 | 188461538.50000 | 131.52564 | -23.88445 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.30000 | -1.15859 | 1.46818 | 4 | 99 | -43623000 | 309471000 | 1.20000 | 257892500.00000 | 22.75000 | -0.06310 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.92000 | 16.67884 | 1.61028 | 32 | 65 | 272000000 | 339000000 | 0.68000 | 498529411.80000 | 70.47059 | 1.72307 |
340 rows × 15 columns
dfhc["HC_Clusters"] = HCmodel.labels_
dfhc
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | HC_Clusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.35000 | 9.99999 | 1.68715 | 135 | 51 | -604000000 | 7610000000 | 11.39000 | 668129938.50000 | 3.71817 | -8.78422 | 0 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.24000 | 8.33943 | 2.19789 | 130 | 77 | 51000000 | 5144000000 | 3.15000 | 1633015873.00000 | 18.80635 | -8.75007 | 0 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.91000 | 11.30112 | 1.27365 | 21 | 67 | 938000000 | 4423000000 | 2.94000 | 1504421769.00000 | 15.27551 | -0.39417 | 0 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.94000 | 13.97720 | 1.35768 | 9 | 180 | -240840000 | 629551000 | 1.26000 | 499643650.80000 | 74.55556 | 4.19965 | 0 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.32000 | -1.82786 | 1.70117 | 14 | 272 | 315120000 | 696878000 | 0.31000 | 2247993548.00000 | 178.45161 | 1.05981 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.26000 | 14.88773 | 1.84515 | 15 | 459 | -1032187000 | -4359082000 | -4.64000 | 939457327.60000 | 28.97619 | 6.26177 | 0 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.51618 | -8.69892 | 1.47888 | 142 | 27 | 159000000 | 1293000000 | 2.97000 | 435353535.40000 | 17.68221 | -3.83826 | 0 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.59000 | 9.34768 | 1.40421 | 1 | 100 | 376000000 | 147000000 | 0.78000 | 188461538.50000 | 131.52564 | -23.88445 | 0 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.30000 | -1.15859 | 1.46818 | 4 | 99 | -43623000 | 309471000 | 1.20000 | 257892500.00000 | 22.75000 | -0.06310 | 0 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.92000 | 16.67884 | 1.61028 | 32 | 65 | 272000000 | 339000000 | 0.68000 | 498529411.80000 | 70.47059 | 1.72307 | 0 |
340 rows × 16 columns
cluster_profile_hc = dfhc.groupby("HC_Clusters").mean()
cluster_profile_hc["count_in_each_segments"] = (
dfhc.groupby("HC_Clusters")["Current Price"].count().values
)
# let's see the names of the ticker symbols in each cluster
for cl in dfhc["HC_Clusters"].unique():
print(
"The",
dfhc[dfhc["HC_Clusters"] == cl]["Ticker Symbol"].nunique(),
"Tickers in cluster",
cl,
"are:",
)
print(dfhc[dfhc["HC_Clusters"] == cl]["Ticker Symbol"].unique())
print("-" * 100, "\n")
The 334 Tickers in cluster 0 are: ['AAL' 'ABBV' 'ABT' 'ADBE' 'ADI' 'ADM' 'ADS' 'AEE' 'AEP' 'AFL' 'AIG' 'AIV' 'AIZ' 'AJG' 'AKAM' 'ALB' 'ALK' 'ALL' 'ALLE' 'ALXN' 'AMAT' 'AME' 'AMG' 'AMGN' 'AMP' 'AMT' 'AMZN' 'AN' 'ANTM' 'AON' 'APC' 'APH' 'ARNC' 'ATVI' 'AVB' 'AVGO' 'AWK' 'AXP' 'BA' 'BAX' 'BBT' 'BCR' 'BHI' 'BIIB' 'BK' 'BLL' 'BMY' 'BSX' 'BWA' 'BXP' 'C' 'CAT' 'CB' 'CBG' 'CCI' 'CCL' 'CELG' 'CF' 'CFG' 'CHD' 'CHRW' 'CHTR' 'CI' 'CINF' 'CL' 'CMA' 'CME' 'CMG' 'CMI' 'CMS' 'CNC' 'CNP' 'COF' 'COG' 'COO' 'CSX' 'CTL' 'CTSH' 'CTXS' 'CVS' 'CVX' 'CXO' 'D' 'DAL' 'DD' 'DE' 'DFS' 'DGX' 'DHR' 'DIS' 'DISCA' 'DISCK' 'DLPH' 'DLR' 'DNB' 'DOV' 'DPS' 'DUK' 'DVA' 'DVN' 'EBAY' 'ECL' 'ED' 'EFX' 'EIX' 'EMN' 'EOG' 'EQIX' 'EQR' 'EQT' 'ES' 'ESS' 'ETFC' 'ETN' 'ETR' 'EW' 'EXC' 'EXPD' 'EXPE' 'EXR' 'F' 'FAST' 'FBHS' 'FCX' 'FE' 'FIS' 'FISV' 'FLIR' 'FLR' 'FLS' 'FMC' 'FRT' 'FSLR' 'FTR' 'GD' 'GGP' 'GILD' 'GLW' 'GM' 'GPC' 'GRMN' 'GT' 'GWW' 'HAL' 'HAS' 'HBAN' 'HCA' 'HCN' 'HCP' 'HES' 'HIG' 'HOG' 'HON' 'HPE' 'HPQ' 'HRL' 'HSIC' 'HST' 'HSY' 'HUM' 'IBM' 'IDXX' 'IFF' 'IP' 'IPG' 'IRM' 'ISRG' 'ITW' 'IVZ' 'JBHT' 'JEC' 'JNPR' 'JPM' 'KIM' 'KMB' 'KMI' 'KO' 'KSU' 'LEG' 'LEN' 'LH' 'LKQ' 'LLL' 'LLY' 'LMT' 'LNT' 'LUK' 'LUV' 'LVLT' 'LYB' 'MA' 'MAA' 'MAC' 'MAR' 'MAS' 'MAT' 'MCD' 'MCO' 'MDLZ' 'MET' 'MHK' 'MJN' 'MKC' 'MLM' 'MMC' 'MMM' 'MNST' 'MO' 'MOS' 'MPC' 'MRK' 'MRO' 'MTB' 'MTD' 'MUR' 'MYL' 'NAVI' 'NBL' 'NDAQ' 'NEE' 'NEM' 'NFLX' 'NFX' 'NLSN' 'NOV' 'NSC' 'NTRS' 'NUE' 'NWL' 'O' 'OKE' 'OMC' 'ORLY' 'OXY' 'PBCT' 'PBI' 'PCAR' 'PCG' 'PEG' 'PEP' 'PFE' 'PFG' 'PG' 'PGR' 'PHM' 'PM' 'PNC' 'PNR' 'PNW' 'PPG' 'PPL' 'PRU' 'PSX' 'PWR' 'PX' 'PYPL' 'R' 'RCL' 'REGN' 'RHI' 'ROP' 'RRC' 'RSG' 'SCG' 'SCHW' 'SE' 'SEE' 'SHW' 'SLG' 'SNI' 'SO' 'SPG' 'SPGI' 'SRCL' 'SRE' 'STI' 'STT' 'SWKS' 'SWN' 'SYF' 'SYK' 'T' 'TAP' 'TDC' 'TGNA' 'TMK' 'TMO' 'TRIP' 'TRV' 'TSCO' 'TSN' 'TSO' 'TSS' 'TXN' 'UAA' 'UAL' 'UDR' 'UHS' 'UNH' 'UNM' 'UNP' 'UPS' 'UTX' 'VAR' 'VLO' 'VMC' 'VNO' 'VRSK' 'VRSN' 'VRTX' 'VTR' 'VZ' 'WAT' 'WEC' 'WFC' 'WHR' 'WM' 'WMB' 'WU' 'WY' 'WYN' 'WYNN' 'XEC' 'XEL' 'XL' 'XOM' 'XRAY' 'XRX' 'XYL' 'YHOO' 'YUM' 'ZBH' 'ZION' 'ZTS'] ---------------------------------------------------------------------------------------------------- The 2 Tickers in cluster 2 are: ['APA' 'CHK'] ---------------------------------------------------------------------------------------------------- The 2 Tickers in cluster 1 are: ['BAC' 'INTC'] ---------------------------------------------------------------------------------------------------- The 1 Tickers in cluster 3 are: ['FB'] ---------------------------------------------------------------------------------------------------- The 1 Tickers in cluster 4 are: ['PCLN'] ----------------------------------------------------------------------------------------------------
HCmodel1 = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="ward")
HCmodel1.fit(subset_scaled_df)
AgglomerativeClustering(n_clusters=5)
dfhc["HC_Clusters"] = HCmodel1.labels_
dfhc
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | HC_Clusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.35000 | 9.99999 | 1.68715 | 135 | 51 | -604000000 | 7610000000 | 11.39000 | 668129938.50000 | 3.71817 | -8.78422 | 3 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.24000 | 8.33943 | 2.19789 | 130 | 77 | 51000000 | 5144000000 | 3.15000 | 1633015873.00000 | 18.80635 | -8.75007 | 3 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.91000 | 11.30112 | 1.27365 | 21 | 67 | 938000000 | 4423000000 | 2.94000 | 1504421769.00000 | 15.27551 | -0.39417 | 3 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.94000 | 13.97720 | 1.35768 | 9 | 180 | -240840000 | 629551000 | 1.26000 | 499643650.80000 | 74.55556 | 4.19965 | 3 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.32000 | -1.82786 | 1.70117 | 14 | 272 | 315120000 | 696878000 | 0.31000 | 2247993548.00000 | 178.45161 | 1.05981 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.26000 | 14.88773 | 1.84515 | 15 | 459 | -1032187000 | -4359082000 | -4.64000 | 939457327.60000 | 28.97619 | 6.26177 | 0 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.51618 | -8.69892 | 1.47888 | 142 | 27 | 159000000 | 1293000000 | 2.97000 | 435353535.40000 | 17.68221 | -3.83826 | 3 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.59000 | 9.34768 | 1.40421 | 1 | 100 | 376000000 | 147000000 | 0.78000 | 188461538.50000 | 131.52564 | -23.88445 | 3 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.30000 | -1.15859 | 1.46818 | 4 | 99 | -43623000 | 309471000 | 1.20000 | 257892500.00000 | 22.75000 | -0.06310 | 3 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.92000 | 16.67884 | 1.61028 | 32 | 65 | 272000000 | 339000000 | 0.68000 | 498529411.80000 | 70.47059 | 1.72307 | 3 |
340 rows × 16 columns
cluster_profile_hc = dfhc.groupby("HC_Clusters").mean()
cluster_profile_hc["count_in_each_segments"] = (
dfhc.groupby("HC_Clusters")["Current Price"].count().values
)
cluster_profile_hc
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||||||
| 0 | 326.19822 | 10.56324 | 1.64256 | 14.40000 | 309.46667 | 288850666.66667 | 864498533.33333 | 7.78533 | 544900261.30133 | 113.09533 | 19.14215 | 15 |
| 1 | 84.35572 | 3.85498 | 1.82767 | 633.57143 | 33.57143 | -568400000.00000 | -4968157142.85714 | -10.84143 | 398169036.44286 | 42.28454 | -11.58950 | 7 |
| 2 | 42.84818 | 6.27045 | 1.12355 | 22.72727 | 71.45455 | 558636363.63636 | 14631272727.27273 | 3.41000 | 4242572567.29091 | 15.24217 | -4.92462 | 11 |
| 3 | 72.76040 | 5.21331 | 1.42708 | 25.60351 | 60.39298 | 79951512.28070 | 1538594322.80702 | 3.65535 | 446472132.22846 | 24.72267 | -2.64719 | 285 |
| 4 | 36.44046 | -16.07341 | 2.83288 | 57.50000 | 42.40909 | -472834090.90909 | -3161045227.27273 | -8.00500 | 514367806.20182 | 85.55568 | 0.83684 | 22 |
# let's see the names of the ticker symbols in each cluster
for cl in dfhc["HC_Clusters"].unique():
print(
"The",
dfhc[dfhc["HC_Clusters"] == cl]["Ticker Symbol"].nunique(),
"Tickers in cluster",
cl,
"are:",
)
print(dfhc[dfhc["HC_Clusters"] == cl]["Ticker Symbol"].unique())
print("-" * 100, "\n")
The 285 Tickers in cluster 3 are: ['AAL' 'ABBV' 'ABT' 'ADBE' 'ADI' 'ADM' 'AEE' 'AEP' 'AFL' 'AIG' 'AIV' 'AIZ' 'AJG' 'AKAM' 'ALB' 'ALK' 'ALL' 'AMAT' 'AME' 'AMG' 'AMP' 'AMT' 'AN' 'ANTM' 'AON' 'APH' 'ARNC' 'ATVI' 'AVB' 'AVGO' 'AWK' 'AXP' 'BA' 'BAX' 'BBT' 'BCR' 'BIIB' 'BK' 'BLL' 'BMY' 'BSX' 'BWA' 'BXP' 'CAT' 'CB' 'CBG' 'CCI' 'CCL' 'CELG' 'CF' 'CFG' 'CHD' 'CHRW' 'CI' 'CINF' 'CMA' 'CME' 'CMI' 'CMS' 'CNC' 'CNP' 'COF' 'COO' 'CSX' 'CTL' 'CTSH' 'CTXS' 'CVS' 'CVX' 'D' 'DAL' 'DD' 'DE' 'DFS' 'DGX' 'DHR' 'DIS' 'DISCA' 'DISCK' 'DLPH' 'DLR' 'DNB' 'DOV' 'DPS' 'DUK' 'DVA' 'EBAY' 'ECL' 'ED' 'EFX' 'EIX' 'EMN' 'EQR' 'EQT' 'ES' 'ESS' 'ETFC' 'ETN' 'ETR' 'EW' 'EXC' 'EXPD' 'EXPE' 'EXR' 'FAST' 'FBHS' 'FE' 'FIS' 'FISV' 'FLIR' 'FLR' 'FLS' 'FMC' 'FRT' 'FSLR' 'GD' 'GGP' 'GILD' 'GLW' 'GM' 'GPC' 'GRMN' 'GT' 'GWW' 'HAL' 'HAS' 'HBAN' 'HCA' 'HCN' 'HCP' 'HIG' 'HOG' 'HON' 'HPE' 'HPQ' 'HRL' 'HSIC' 'HST' 'HSY' 'HUM' 'IBM' 'IDXX' 'IFF' 'IP' 'IPG' 'IRM' 'ITW' 'IVZ' 'JBHT' 'JEC' 'JNPR' 'KIM' 'KSU' 'LEG' 'LEN' 'LH' 'LKQ' 'LLL' 'LLY' 'LMT' 'LNT' 'LUK' 'LUV' 'LVLT' 'LYB' 'MA' 'MAA' 'MAC' 'MAR' 'MAS' 'MAT' 'MCD' 'MCO' 'MDLZ' 'MET' 'MHK' 'MJN' 'MKC' 'MLM' 'MMC' 'MMM' 'MO' 'MOS' 'MPC' 'MRK' 'MTB' 'MTD' 'MYL' 'NAVI' 'NDAQ' 'NEE' 'NEM' 'NLSN' 'NSC' 'NTRS' 'NUE' 'NWL' 'O' 'OMC' 'ORLY' 'PBCT' 'PBI' 'PCAR' 'PCG' 'PEG' 'PEP' 'PFG' 'PG' 'PGR' 'PHM' 'PM' 'PNC' 'PNR' 'PNW' 'PPG' 'PPL' 'PRU' 'PSX' 'PWR' 'PX' 'PYPL' 'R' 'RCL' 'RHI' 'ROP' 'RSG' 'SCG' 'SCHW' 'SEE' 'SHW' 'SLG' 'SNI' 'SO' 'SPG' 'SRCL' 'SRE' 'STI' 'STT' 'SWKS' 'SYF' 'SYK' 'TAP' 'TGNA' 'TMK' 'TMO' 'TRIP' 'TRV' 'TSCO' 'TSN' 'TSO' 'TSS' 'TXN' 'UAA' 'UAL' 'UDR' 'UHS' 'UNH' 'UNM' 'UNP' 'UPS' 'UTX' 'VAR' 'VLO' 'VMC' 'VNO' 'VRSK' 'VRSN' 'VRTX' 'VTR' 'WEC' 'WHR' 'WM' 'WU' 'WY' 'WYN' 'WYNN' 'XEL' 'XL' 'XRAY' 'XRX' 'XYL' 'YUM' 'ZBH' 'ZION' 'ZTS'] ---------------------------------------------------------------------------------------------------- The 15 Tickers in cluster 0 are: ['ADS' 'ALXN' 'AMGN' 'AMZN' 'CMG' 'EQIX' 'FB' 'FTR' 'ISRG' 'MNST' 'NFLX' 'PCLN' 'REGN' 'WAT' 'YHOO'] ---------------------------------------------------------------------------------------------------- The 7 Tickers in cluster 1 are: ['ALLE' 'APA' 'CHK' 'CHTR' 'CL' 'KMB' 'SPGI'] ---------------------------------------------------------------------------------------------------- The 22 Tickers in cluster 4 are: ['APC' 'BHI' 'COG' 'CXO' 'DVN' 'EOG' 'FCX' 'HES' 'KMI' 'MRO' 'MUR' 'NBL' 'NFX' 'NOV' 'OKE' 'OXY' 'RRC' 'SE' 'SWN' 'TDC' 'WMB' 'XEC'] ---------------------------------------------------------------------------------------------------- The 11 Tickers in cluster 2 are: ['BAC' 'C' 'F' 'INTC' 'JPM' 'KO' 'PFE' 'T' 'VZ' 'WFC' 'XOM'] ----------------------------------------------------------------------------------------------------
# lets display cluster profile
cluster_profile_hc.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||||||
| 0 | 326.198218 | 10.563242 | 1.642560 | 14.400000 | 309.466667 | 288850666.666667 | 864498533.333333 | 7.785333 | 544900261.301333 | 113.095334 | 19.142151 | 15 |
| 1 | 84.355716 | 3.854981 | 1.827670 | 633.571429 | 33.571429 | -568400000.000000 | -4968157142.857142 | -10.841429 | 398169036.442857 | 42.284541 | -11.589502 | 7 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 11 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 79951512.280702 | 1538594322.807018 | 3.655351 | 446472132.228456 | 24.722670 | -2.647194 | 285 |
| 4 | 36.440455 | -16.073408 | 2.832884 | 57.500000 | 42.409091 | -472834090.909091 | -3161045227.272727 | -8.005000 | 514367806.201818 | 85.555682 | 0.836839 | 22 |
# Plotting subplots for all numerical columns for each cluster labels
plt.figure(figsize=(20, 15))
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
sns.boxplot(y=dfhc[variable], x=dfhc["HC_Clusters"])
plt.tight_layout()
plt.title(variable)
plt.suptitle("Boxplot of numerical variables for each cluster", fontsize=18)
plt.show()